Skip to content

[client] Feature/lazy connection #3379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 110 commits into from
May 21, 2025
Merged

[client] Feature/lazy connection #3379

merged 110 commits into from
May 21, 2025

Conversation

pappz
Copy link
Contributor

@pappz pappz commented Feb 24, 2025

Describe your changes

With the lazy connection feature, the peer will connect to target peers on-demand. The trigger can be any IP traffic.

This feature can be enabled with the NB_ENABLE_EXPERIMENTAL_LAZY_CONN environment variable.

When the engine receives a network map, it binds a free UDP port for every remote peer, and the system configures WireGuard endpoints for these ports. When traffic appears on a UDP socket, the system removes this listener and starts the peer connection procedure immediately.

Key changes

  • Fix slow netbird status -d command
  • Move from engine.go file to conn_mgr.go the peer connection related code
  • Refactor the iface interface usage and moved interface file next to the engine code
  • Add new command line flag and UI option to enable feature
  • The peer.Conn struct is reusable after it has been closed.
  • Change connection states

Connection states

Idle: The peer is not attempting to establish a connection. This typically means it's in a lazy state or the remote peer is expired.

Connecting: The peer is actively trying to establish a connection. This occurs when the peer has entered an active state and is continuously attempting to reach the remote peer.

Connected: A successful peer-to-peer connection has been established and communication is active.

Issue ticket number and link

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)
  • Extended the README / documentation, if necessary

pappz and others added 5 commits May 5, 2025 10:46
fix idle state managmenet
Validate the threshold env variable
If the remote peer initiates the connection, we just reset the inactivity listener, but do not start it. As a result, after a disconnection, our peer never switches back to the idle state.

For every peer, we instantiate an inactivity listener. It is a reusable timer, but it must start on a go routine after it has been stopped. The key challenge is to ensure that we never start the timer multiple times and always clean it up properly when the peer is in the idle state or it has been set to the exclusion list.
Copy link
Contributor

@lixmal lixmal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description doesn't match code anymore (env vars)

if d.isClosed {
d.peerCfg.Log.Debugf("exit from activity listener")
} else {
d.peerCfg.Log.Errorf("failed to read from activity listener: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not retry (continue) reading if there's an error != closed? Maybe with a ctx check in-between

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With it, we risk a high load loop. If it fails to read from UDP once, what makes us think it will succeed the second time?

Copy link
Contributor

@lixmal lixmal May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With it, we risk a high load loop.

True

If it fails to read from UDP once, what makes us think it will succeed the second time?

It could be a temporary error, e.g. OS ran out of file descriptors.
But yeah, in general this should be rare. Maybe we can add a comment here explaining the implication for the time being?

pappz added 8 commits May 15, 2025 12:53
Add an option to the management proto to enable or disable the lazy connection feature on the peer side. Command-line options and environment variables override the management configuration.

- When the feature is enabled: Start the lazy connection manager and add all peers to the manager in the "active" st-ate.

- When the feature is disabled: Stop the lazy connection manager and establish connections to all peers in the store.

Fix deadlock

The conn_mgr is protected by a mutex. However, because some internal functions are called via callbacks from the lazy connection manager, a deadlock can occur. The solution is to completely eliminate the callbacks.
lixmal
lixmal previously approved these changes May 15, 2025
lixmal
lixmal previously approved these changes May 20, 2025
lixmal
lixmal previously approved these changes May 20, 2025
@mlsmaycon mlsmaycon requested a review from Copilot May 20, 2025 08:54
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds an experimental “lazy connection” feature allowing on-demand WireGuard peer connections and refactors related engine and stats handling.

  • Introduce ConnMgr to manage lazy vs. permanent connections based on traffic or feature flag
  • Integrate ConnMgr into the engine startup/stop flows and expose CLI/config flags
  • Refactor WG stats interfaces to return a map of stats and update parsing logic

Reviewed Changes

Copilot reviewed 60 out of 60 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
client/internal/lazyconn/activity/listener.go UDP listener for on-demand connection trigger
client/internal/lazyconn/activity/listen_ip_linux.go Linux build listen IP definition
client/internal/lazyconn/activity/listen_ip.go Default listen IP for other platforms
client/internal/iface_common.go Change GetStats to return map[string]WGStats
client/internal/engine_test.go Update tests to initialize and start ConnMgr
client/internal/engine.go Wire up feature flag, engine start/stop, and peer flows
client/internal/debug/debug.go Add LazyConnectionEnabled to debug bundle output
client/internal/connect.go Propagate lazy-connection flag into EngineConfig
client/internal/conn_mgr.go New ConnMgr implementing lazy connection logic
client/internal/config.go Read/apply LazyConnectionEnabled in config
client/iface/iface.go Update GetStats signature to return map
client/iface/device/interface.go Update GetStats signature to return map
client/iface/configurer/usp_test.go Adapt tests for new stats parsing function
client/iface/configurer/usp.go Refactor stats parsing into parseTransfers & helper funcs
client/iface/configurer/kernel_unix.go Update stats method to return map of WGStats
client/cmd/up.go Set lazy flag for up command
client/cmd/status.go Extend status filter options to include `idle
client/cmd/root.go Add --enable-lazy-connection CLI flag
client/internal/dns/wgiface_windows.go Remove outdated GetStats(peerKey) from DNS iface
client/internal/dns/wgiface.go Remove outdated GetStats(peerKey) from DNS iface
Comments suppressed due to low confidence (2)

client/internal/lazyconn/activity/listener.go:33

  • [nitpick] The error message has a grammatical issue: replace "failed to creating" with "failed to create".
return nil, fmt.Errorf("failed to creating activity listener: %v", err)

client/iface/configurer/usp.go:287

  • [nitpick] The local variable TxBytes begins with an uppercase letter, which is typically reserved for exported identifiers. Rename it to txBytes to follow Go naming conventions.
TxBytes, err := toBytes(key[1])

Comment on lines +47 to +48
for {
n, remoteAddr, err := d.conn.ReadFromUDP(make([]byte, 1))
Copy link
Preview

Copilot AI May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocating a new buffer on each ReadFromUDP call can be inefficient; consider reusing a single []byte buffer declared outside the loop.

Suggested change
for {
n, remoteAddr, err := d.conn.ReadFromUDP(make([]byte, 1))
buffer := make([]byte, 1) // Reusable buffer for reading packets
for {
n, remoteAddr, err := d.conn.ReadFromUDP(buffer)

Copilot uses AI. Check for mistakes.

@@ -481,7 +482,7 @@ func connectToSignal(ctx context.Context, wtConfig *mgmProto.NetbirdConfig, ourP
return signalClient, nil
}

// loginToManagement creates Management Services client, establishes a connection, logs-in and gets a global Netbird config (signal, turn, stun hosts, etc)
// loginToManagement creates Management ServiceDependencies client, establishes a connection, logs-in and gets a global Netbird config (signal, turn, stun hosts, etc)
Copy link
Preview

Copilot AI May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment refers to a "Management ServiceDependencies client", which is unclear. Consider restoring "Management Services client" or clarifying what "ServiceDependencies" refers to.

Suggested change
// loginToManagement creates Management ServiceDependencies client, establishes a connection, logs-in and gets a global Netbird config (signal, turn, stun hosts, etc)
// loginToManagement creates Management Services client, establishes a connection, logs-in and gets a global Netbird config (signal, turn, stun hosts, etc)

Copilot uses AI. Check for mistakes.

Copy link

@pappz pappz merged commit daa8380 into main May 21, 2025
35 checks passed
@pappz pappz deleted the feature/lazy-connection branch May 21, 2025 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants